Data Collection
Data is all around us: think about weather forecasts, traffic data on
Google Maps, statistics of your favourite sports teams, or even the
number of likes on an Instagram post. Every day, we use this data to
make decisions. We use weather data to decide what to wear; traffic data
to decide which route to take to university; sports statistics to decide
if our team has a chance to win their next match; social media data to
decide if an influencer is popular or not; and much more.
Any statistical model or analysis is only as good as the input data.
No matter how sophisticated the analysis, if the input data is poor, the
output results will not be any good. This is known as the
Garbage-In-Garbage-Out (GIGO) principle.
As good data analysts, it is our responsibility to ensure that,
whenever we are involved in data collection, it is done correctly, in a
systematic, accurate, and unbiased way. The goal of statistical data
collection is to gather information that can be used to find patterns
and trends, accurately answer questions, test hypotheses, and make
evidence-based decisions.
The process of data collection is multi-faceted. It involves
identifying what we want to study, choosing the right methods to collect
the data (like surveys, experiments, observations, or finding secondary
or tertiary data sources), and ensuring the information is reliable and
representative. By learning about data collection, you will gain tools
to make informed decisions in a variety of fields, from science and
business to social issues. It will also equip you, as a data analyst, to
help others to do so.
Q: Can you think of other sources of data that you use to make daily
decisions? How reliable do you think these sources are?
Planning data collection
Before you start collecting data, it is crucial that you plan how you
are going to do it. You will need to ask yourself (at least) the
following questions:
- What is the question I am trying to answer?
- What/who is the population I am studying?
- What kind of information do I need from the population to answer
this question?
- How can I obtain a dataset that will be representative of the
information I need from this population?
- How can I obtain such a dataset in an ethical way?

Example 1: Raheem wants to open a new resaurant on
Hatfield campus. Before doing this, he wants to know what the students
think of the restaurants and fast food places already available on
campus, in terms of the cost, freshness variety and taste of food.
Following the data collection planning questions above, he gives the
following answers:
- What is the question I am trying to answer? What are the
perceptions of students on Hatfield campus about the cost, freshness,
variety and taste of food already available on campus?
- What/who is the population I am studying? Students on
Hatfield campus.
- What kind of information do I need from the population to answer
this question? Answers from students about their perceptions of
the available food options.
- How can I obtain a dataset that will be representative of the
information I need from this population? I can conduct a survey
of students on Hatfield campus.
- How can I obtain such a dataset in an ethical way? By
obtaining the relevant permissions to conduct such a survey from the
University, and to obtain clear consent from every student I ask, after
I have explained the purpose of the survey. I will also keep the
students’ answers anonymous.
Example 2: Tebogo is an analyst for a private
security firm operating on the Hatfield City Improvement District (CID).
Her security firm has recently employed a new strategy to combat theft.
She wants to know whether thefts have decreased since they employed the
strategy. She answers the questions as follows:
- What is the question I am trying to answer? Have thefts in
the Hatfield CID decreased since our security firm employed its new
theft-prevention strategy?
- What/who is the population I am studying? Everyone working
in, living in, or travelling through the Hatfield CID.
- What kind of information do I need from the population to answer
this question? Theft statistics from all relevant police
stations whose precincts overlap with the CID.
- How can I obtain a dataset that will be representative of the
information I need from this population? I can request crime
statistics from the relevant police stations.
- How can I obtain such a dataset in an ethical way? By
obtaining the relevant permissions from the SAPD, the relevant persons
at the police stations themselves, and signing any necessary agreements
about my use of the data.
Example 3: William is an ecologist who wants to
determine if a new pesticide-free anti-fungal treatment he has developed
will keep maize safe from fungal infections. Below are his answers to
the data collection planning questions:
- What is the question I am trying to answer? Whether my new
anti-fungal treatment works to protect maize from fungal
infections.
- What/who is the population I am studying? Maize
plants.
- What kind of information do I need from the population to answer
this question? Data on the health of maize plants that were
given the treatment, and maize plants that were not given the treatment,
when exposed to fungi.
- How can I obtain a dataset that will be representative of the
information I need from this population? By planting two fields
of maize, giving one the anti-fungal treatment and leaving the other
without treatment, and then exposing them both to the
fungus.
- How can I obtain such a dataset in an ethical way? By making
sure the fungus cannot spread to any other plants or
crops.
Class Exercise Question 1: You are asked to
determine the favourite movie of first-year mathematical sciences
students in your class. Answer the data collection planning questions at
the beginning of this section to determine what your intended dataset
is, and how you will collect it.
- What is the question I am trying to answer?
- What/who is the population I am studying?
- What kind of information do I need from the population to answer
this question?
- How can I obtain a dataset that will be representative of the
information I need from this population?
- How can I obtain such a dataset in an ethical way?
Primary data collection
Primary data is collected first-hand by the researcher in order to
answer a specific question or questions. Examples of primary data
collection include conducting interviews and surveys to ask about
people’s opinions and experiences; conducting experiments in a
laboratory; collecting field data, such as animal tracking data; and
taking direct measurements (e.g. the chemistry of plants, or the weight
of animals).
Primary data is complex and resource-intensive to collect. It also
requires an in-depth understanding of the answers to the data collection
planning questions in the previous section. When you collect primary
data, it is your responsibility to ensure that the correct data is
collected in the correct way, and that the data is representative,
unbiased, and ethical. We will learn more about representative and
unbiased data in the section on evaluating data.
Exercise: For each of Examples 1-3 in the previous
section, identify the type of primary data collected (survey,
experiment, field data, or direct measurements), or indicate if it was
not primary data.
Class Exercise Question 2: Is the dataset from Class
Exercise Question 1 a primary, secondary or tertiary dataset? Can you
obtain the data through surveys, interviews, experiments, field data, or
direct measurements?
Secondary data collection
Secondary data is data that was collected by a different researcher
for a purpose that is different from the current study. Examples of
secondary data include data from the national census, marketing data
collected by a company, crime data, social media data, and more.
Secondary data collection is usually done in one of the following
ways:
- By approaching the custodian of the primary data and obtaining their
permission to use the data. This is usually done with sensitive data
like public health or crime data.
- By downloading an open dataset from the internet.
- By webscraping or performing other techniques to gather data from
the internet.
Q: Can you think of other examples of secondary data, and how you
would collect them?
The main challenge when collecting secondary data is to make sure
that it is the correct data to answer your research question. Even
though secondary data was collected by someone else, you as the analyst
still need to ensure that the data is of good quality, and ethical.
Although the original data was not collected by you, you are still
responsible for the ethics of the data as it pertains to your study. If
the data was collected unethically, you could still face consequences
for using it. This means that you cannot assume that the data is
relevant and ethical.
Exercise: For each of Examples 1-3, if the data
collected was not primary, identify the type of secondary data
collected.
Evaluating Data
Once you have collected data (whether it is primary or secondary),
you need to be able to determine if the data is good and fit for use.
This section explains the attributes of a good dataset, and how to check
if it is relevant to the research question at hand. The key aspects of a
good dataset include relevance, quality, representativeness,
unbiasedness, and impact.
Relevance
Relevant data is data that is applicable to the research question at
hand. The analyst must be able to use this data to answer their research
question. It must also be up-to-date for the purpose of the study. It is
important to ensure that data is relevant, since irrelevant or redundant
data can clutter the analysis and reduce the efficiency of the
study.
For example, if an insurer wants to answer a question about
short-term insurance in 2024, a dataset on long-term insurance in 2024
would be irrelevant. Similarly, a dataset on short-term insurance in
1996 would be irrelevant.
In order to ensure that primary data is relevant, one should collect
only necessary data and regularly review datasets for alignment with the
research objectives. For secondary data, one should determine the scope
of the data and date of collection.
Quality
The data must be of good quality. This includes completeness
and consistency.
A dataset is complete if it has minimal missing or
incomplete data. Gaps in data can distort the analysis, or require
assumptions that may not be valid.
In order to ensure that primary data is complete, one should clearly
indicate and document missing data, and where possible, implement
strategies to fill gaps responsibly. For secondary data, one should
determine if any data is missing. If there is a large amount of missing
data, this may indicate that the dataset is not suitable. If there are
minimal missing values, one should use reliable techniques to impute
missing data without making any undue assumptions.
As an example, missing data often arises in longitudinal health
studies. These studies typically attempt to determine the status of
patients over time. Missing values occur when patients do not show up
for follow-up appointments and drop out of the study without explaining
why. This can happen if they simply forget their follow-up appointments;
if they feel better, and no longer feel it is necessary to visit the
hospital; or if they move away; or for a host of other reasons. In such
studies, it is therefore crucial to educate the participating patients
on the need to attend their follow-up visits if they are able to.
A dataset is consistent if the data was recorded in a
uniform and standardised manner across the dataset. Inconsistent data
formats (e.g., different date formats or units) can complicate the
analysis and increase the risk of errors.
For consistency of primary data, it is important to use standardised
units of measurements (if applicable), standard questions with clear
instructions on how to answer them (if applicable), standardised data
formats, and data entry procedures. For secondary data, it is important
to investigate whether the data formats and units are consistent across
the dataset. If they are not consistent, this should be remedied before
the data can be used.
Inconsistencies slip into datasets more easily than one might
suppose. For example, if seven people are asked to write the date of New
Year’s Eve in 2024, they might write it as follows:
- 31/12/24
- 31 Dec 24
- 31st of Dec 2024
- 31/12/2024
- 31 December 2024
- 12/31/2024 (This one is strange, but it follows the date format used
in the USA. Sometimes, people’s phones or laptops might be set to the
USA format by default, which can lead to these errors.)
- 2024/12/31
To ensure consistent dates, for instance, one could provide an
example of a date (e.g. 31/12/2024) or a standard date format
(e.g. DD/MM/YYYY).
Representativeness and unbiasedness
The data must accurately represent the population being studied.
The data must avoid introducing bias. Bias occurs when the data
over-represents some members of the population, and under-represents
others, or if it represents some members of the population in an unduly
positive or negative light.
Impact
The data should not have a potentially harmful impact.
Examples
Example 4 (continuation of Example 1): Raheem has
finished collecting surveys from students about their perceptions of the
food available on campus. He inspects his dataset using the key aspects
above, and comes to the following conclusions:
- Relevance: Since this was primary data, it was collected by
the investigator to answer his specific research question. The data is
up-to-date. It is thus relevant.
- Quality: The surveys that were given to students were
standardised. All students received a copy of the same survey, with
clear instructions on how to answer each question. Thus, the dataset is
consistent. Furthermore, nearly all of the respondents answered all the
questions. Thus, the dataset is complete.
- Representativeness and unbiasedness: Students from all
across Hatfield campus were asked to fill in the survey. This included
students from different years of study, different degrees and different
faculties, as well as diverse demographic and socio-economic
backgrounds. Thus, the data represents the diverse student body on
Hatfield campus. Furthermore, surveys were handed out to students at a
variety of spots on campus, including far away from any food vendors,
and regardless of whether or not students were eating purchased food,
home-made food, or not eating at all. Thus, there was little if any
bias.
- Impact: The data should not have a potentially harmful impact.
The data was anonymised, so that students’ answers on the survey
could not be linked to their identities in any way. Any mention of
specific restaurants or food outlets was also removed, so that no
student’s opinion could be linked to any existing vendor on Hatfield
campus. Thus, there is very little chance of any potentially harmful
impact on either students or food vendors.
Example 5 (continuation of Example 2):
- Relevance: .
- Quality: .
- Representativeness and unbiasedness: .
- Impact: .
Example 6 (continuation of Example 3):
- Relevance: .
- Quality: .
- Representativeness and unbiasedness: .
- Impact: .
Class Exercise Question 3: